National Student Clearinghouse - Enrollment

Data Prep

Okay, lets check out the National Student Clearinghouse enrollment data. I will first import the entire dataset, clean it up, and then merge it with the master / Southern graduate dataset.


nsc.enrollment.original <- read_csv('/Users/kellyasche/Library/CloudStorage/GoogleDrive-kasche@ruralmn.org/My Drive/Research/FY25-27/FY26/OHE Journey to meaningful employment/Data/SLEDS/NSCEnrollment/NSC_Enrollments.csv') %>%
  mutate(PersonID = as.integer(PersonID),
         OPEID = str_pad(OPEID, width = 8, side = "left", pad = "0"),
         OPEID.6 = str_sub(OPEID, 1, 6)) %>%
  drop_na(PersonID) %>%
  drop_na(OPEID)
  
kable(head(nsc.enrollment.original))
PersonID EnrollmentBeginTimeID EnrollmentEndTimeID OPEID EnrollmentStatus OPEID.6
10025234 20120103 20120126 00235400 H 002354
11364693 20130122 20130517 00238881 F 002388
3090532 20090824 20091218 00238300 F 002383
11127041 20200601 20200819 00238200 L 002382
4465956 20100823 20101217 00299700 F 002997
9191392 20110110 20110513 00553400 L 005534
kable(names(nsc.enrollment.original))
x
PersonID
EnrollmentBeginTimeID
EnrollmentEndTimeID
OPEID
EnrollmentStatus
OPEID.6


The dataset has 3,010,265 rows and 6 columns. Here are the columns and their definitions.

  • EnrollmentBeginTimeID: Begin date for the student’s period of attendance.
  • EnrollmentEndTimeID: End date for the student’s period of attendance.
  • OPEID: Office of Postsecondary Education (OPE)/FICE code of the college that the student attended (Foreign Key to IPEDSCharacteristics).
  • InstitutionName: Name of institution
  • EnrollmentStatus: The last enrollment status reported for the student. This field will have ‘N/A’ or “NULL” if the reporting college has not defined the student’s enrollment status as directory information. Here are the code definitions;
    • F: Full-time
    • Q: Three-quarter time
    • H: Half-time
    • L: Less than half-time
    • A: Leave of absence
    • W: Withdrawn
    • D: Deceased

Essentially, this dataset provides semester-based information - each observation is a semester/single period with the institutional information and beginning and end time of that single period (usually a semester). For example, if an individual attended South Dakota State University, attended in a “typical” fashion (fall and spring semesters), and graduated in 4 years, there would be 8 observations for that PersonID - one observation per semester.

There are a few pieces of information that I want to consolidate from this dataset;

  1. Did the PersonID in the master dataset attend a college - yes or no?
  2. Did the PersonID attend college during high school (PSEO), immediately after graduation or wait?
  3. Did the PersonID attend multiple colleges - yes or no?
  4. What type of college did the PersonID attend first?
  5. Did the PersonID attend a public, private not-for-profit, or private or, for-profit college(s)?
  6. At any point during their college career, did the PersonID attend a college(s) inside or outside of the Southwest Region or outside the EDR of their high school?

I was going to include the length of time that a PersonID attended college, but due to differences in how institutions report enrollment periods, it didn’t seem like it would be a great indicator.

The pieces of information missing in the original dataset is location and type of institution attended.To do this we will need to join the original dataset with an IPEDS dataset using opeid.

The IPEDS data and the NSC data will only match using the first 6 digits of the IPEDS OPEID values. So I will change that in the NSC data. Then I can join.


ipeds.original <- read_csv('/Users/kellyasche/Library/CloudStorage/GoogleDrive-kasche@ruralmn.org/My Drive/Research/FY25-27/FY26/OHE Journey to meaningful employment/Data/SLEDS/IPEDS/IPEDS.csv') 

ipeds.sector <- ipeds.original %>%
  filter(!is.na(OPEID)) %>%
  mutate(OPEID = str_pad(OPEID, width = 8, side = "left", pad = "0"),
         OPEID.6 = str_sub(OPEID, 1, 6)) %>%
  select(Unitid, OPEID.6, City, State, FIPS, GeographicRegion, CountyCode, InstitutionSector, InstitutionName)  %>%
  mutate(CountyCode = str_pad(CountyCode, width = 5, side = "left", pad = "0"))

nsc.enrollment.ipeds<- nsc.enrollment.original %>%
  left_join(ipeds.sector, by = "OPEID.6") %>%
  drop_na(Unitid) %>%
  mutate(EnrollmentStatus = as_factor(EnrollmentStatus))

kable(head(nsc.enrollment.ipeds))
PersonID EnrollmentBeginTimeID EnrollmentEndTimeID OPEID EnrollmentStatus OPEID.6 Unitid City State FIPS GeographicRegion CountyCode InstitutionSector InstitutionName
10025234 20120103 20120126 00235400 H 002354 173665 Saint Paul MN 27 4 27123 2 Hamline University
11364693 20130122 20130517 00238881 F 002388 174233 Duluth MN 27 4 27137 1 University of Minnesota-Duluth
3090532 20090824 20091218 00238300 F 002383 174862 Saint Bonifacius MN 27 4 27019 2 Crown College
11127041 20200601 20200819 00238200 L 002382 174844 Northfield MN 27 4 27131 2 St Olaf College
4465956 20100823 20101217 00299700 F 002997 200332 Fargo ND 38 4 38017 1 North Dakota State University-Main Campus
9191392 20110110 20110513 00553400 L 005534 174756 Saint Cloud MN 27 4 27145 4 St Cloud Technical and Community College
kable(names(nsc.enrollment.ipeds))
x
PersonID
EnrollmentBeginTimeID
EnrollmentEndTimeID
OPEID
EnrollmentStatus
OPEID.6
Unitid
City
State
FIPS
GeographicRegion
CountyCode
InstitutionSector
InstitutionName


After joining these we now have 3,009,843 rows and 14 columns. This is a few less than the original enrollment document due to a few OPEIDs not aligning. But overall, the data now has the institution sector that the individual attended as well as location.

First, we will create a dataset with a column confirming their post-secondary attendance for each unique PersonID in the nsc.enrollment.ipeds.


nsc.enrollment.confirm <- nsc.enrollment.ipeds %>%
  filter(!EnrollmentStatus %in% c("A", "W", "D")) %>%
  distinct(PersonID) %>%
  mutate(attended.ps = "Yes")

kable(head(nsc.enrollment.confirm))
PersonID attended.ps
10025234 Yes
11364693 Yes
3090532 Yes
11127041 Yes
4465956 Yes
9191392 Yes
kable(names(nsc.enrollment.confirm))
x
PersonID
attended.ps


There are 388,959 rows and 2 columns in the dataset. The columns provide the unique PersonID along with a newly created column confirming that they attended post-secondary education.

Next we will extract the first month and year each PersonID attended a post-secondary institution which will help us figure out how long after high school graduation did they wait until they attended a post-secondary institution.


nsc.enrollment.first.attend <- nsc.enrollment.ipeds %>%
  mutate(first.year = as.integer(str_sub(EnrollmentBeginTimeID, 1,4)),
         first.month = as.integer(str_sub(EnrollmentBeginTimeID, 5,6)),
         first.day = as.integer(str_sub(EnrollmentBeginTimeID, 7, 8))) %>%
  group_by(PersonID) %>%
  filter(first.year == min(first.year)) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  group_by(PersonID) %>%
  filter(first.month == min(first.month)) %>%
  ungroup() %>%
  group_by(PersonID) %>%
  filter(first.day == min(first.day)) %>%
  ungroup() %>%
  distinct(PersonID, .keep_all = TRUE) %>%
  mutate(first.month = str_pad(first.month, width = 2, side = "left", pad = "0"),
         first.day = str_pad(first.day, width = 2, side = "left", pad = "0"),
         first.attend.ps = paste(first.year, first.month, first.day, sep = "")) %>%
  select(PersonID, first.attend.ps)

kable(head(nsc.enrollment.first.attend))
PersonID first.attend.ps
3090532 20090824
6790768 20080825
9172836 20060906
1597967 20060905
5583982 20100823
5456903 20230109
kable(names(nsc.enrollment.first.attend))
x
PersonID
first.attend.ps


This dataset has 397,372 rows and 2 columns. Essentially, this dataset provides each unique PersonID with the earliest begin time for post-secondary education.

Next we will determine how many different institiutions the PersonID attended during their post-secondary career.


nsc.enrollment.n.institutions <- nsc.enrollment.ipeds %>%
  group_by(PersonID) %>%
  distinct(OPEID.6, .keep_all = TRUE) %>%
  ungroup() %>%
  group_by(PersonID) %>%
  summarize(n.institutions = n()) %>%
  ungroup()

kable(head(nsc.enrollment.n.institutions))
PersonID n.institutions
14 1
98 1
141 1
198 2
208 1
227 2
kable(names(nsc.enrollment.n.institutions))
x
PersonID
n.institutions


As expected, we have 397,372 rows and 2 columns. Each PersonID in the dataset has the number of unique post-secondary institutions they attended.

Next, we will create a dataset indicating the type of college the PersonID attended first. I will use the IPEDS sector data. Here are the definitions of the institution sector;

  • 0 - Administrative Unit
  • 1 - Public, 4-year or above
  • 2 - Private not-for-profit, 4-year or above
  • 3 - Private for-profit, 4-year or above
  • 4 - Public, 2-year
  • 5 - Private not-for-profit, 2-year
  • 6 - Private for-profit, 2-year
  • 7 - Public, less-than 2-year
  • 8 - Private not-for-profit, less-than 2-year
  • 9 - Private for-profit, less-than 2-year
  • 99 - Sector unknown (not active)


nsc.enrollment.first.institution <- nsc.enrollment.ipeds %>%
  mutate(EnrollmentBeginTimeID = ymd(EnrollmentBeginTimeID)) %>%
  group_by(PersonID) %>%
  filter(EnrollmentBeginTimeID == min(EnrollmentBeginTimeID)) %>%
  ungroup() %>%
  distinct(PersonID, .keep_all = TRUE) %>%
  select(PersonID, InstitutionSector) %>%
  rename(first.InstitutionSector = InstitutionSector)

kable(head(nsc.enrollment.first.institution))
PersonID first.InstitutionSector
3090532 2
6790768 4
9172836 2
1597967 1
5583982 4
5456903 4
kable(names(nsc.enrollment.first.institution))
x
PersonID
first.InstitutionSector


As expected, we have 397,372 rows and 2 columns. This dataset now provides the sector of the first post-secondary institution they attended immediately after college.

Next we will determine thy type(s) of college(s) they attended during their post-secondary career. To do this we will first create a dataset with columns for each institution sector and a confirmation indicator on whether the PersonID attended that particular sector at some point in their career. I will then create a new category indicating that attended more than one type of institution sector. Once completed, I will be left with a dataset that has a column for each unique PersonID and what sector they attended, as well as a newly created code for “attended more than 1 type of sector”.


grad.list <- read_csv("Data/SLEDS/Masters/Master-8.csv") %>%
  select(PersonID)

nsc.enrollment.institution.types <- nsc.enrollment.ipeds %>%
  distinct(PersonID, InstitutionSector) %>%
  mutate(attended = 1) %>%
  complete(PersonID, InstitutionSector, fill = list(attended = 0)) %>%
  mutate(label = paste("ps.sector.", InstitutionSector, sep = "")) %>%
  select(PersonID, label, attended) %>%
  spread(label, attended) %>%
  right_join(grad.list, by = "PersonID") %>%
  gather(key = "key", value = "value", 2:10) %>%
  mutate(value = ifelse(is.na(value), 0, value)) %>%
  spread(key, value)

kable(head(nsc.enrollment.institution.types))
PersonID ps.sector.1 ps.sector.2 ps.sector.3 ps.sector.4 ps.sector.5 ps.sector.6 ps.sector.7 ps.sector.8 ps.sector.9
55 0 0 0 0 0 0 0 0 0
181 0 0 0 0 0 0 0 0 0
280 1 0 0 0 0 0 0 0 0
301 1 0 0 0 0 0 0 0 0
304 0 0 0 0 0 0 0 0 0
412 1 0 0 0 0 0 0 0 0
kable(names(nsc.enrollment.institution.types))
x
PersonID
ps.sector.1
ps.sector.2
ps.sector.3
ps.sector.4
ps.sector.5
ps.sector.6
ps.sector.7
ps.sector.8
ps.sector.9


As expected, we have 122,938 rows and 10 columns. This dataset provides each unique PersonID with the institution sector they attended using the codes listed above. For PersonID’s that attended multiple sectors, they were coded as “10”.

Next we will determine whether they attended a post-secondary institution inside or outside of the planning region, outside their EDR, or outside of Minnesota. Since many of the PersonID in the dataset have attended multiple institutions, we will categorize it in the following way in order to capture the combinations of attendance;

  • Attended inside region only (planning region, EDR, RUCA, State)
  • Attended inside and outside region (planning region, EDR, RUCA, state)
  • Attended outside region only (planning region, EDR, RUCA, state)

In order to do this we will need to combine our planning region and EDR joining documents with the nsc.enrollment.ipeds dataset. I need to remember that the way I arranged the regions are going to be different than the official planning region. We will also need to join the master dataset with it to determine the location of the PersonID’s high school graduation location. Lastly, we will need to join up the RUCA categories for counties outside of Minnesota. Then we can start the categorization process.


dem.desc.county <- read_csv("Data/Join docs/Master-ruca-county.csv")

grad.location <- read_csv("Data/SLEDS/Masters/Master-8.csv") %>%
  select(PersonID, edr, Dem_Desc) %>%
  rename(grad.edr = 2,
         grad.ruca = 3) %>%
  mutate(grad.pr = "Southern")

nsc.enrollment.location.join <- nsc.enrollment.ipeds %>%
  select(PersonID, InstitutionName, CountyCode) %>%
  left_join(dem.desc.county, by = c("CountyCode" = "State-County FIPS Code")) %>%
  left_join(grad.location, by = "PersonID") %>%
  drop_na(grad.ruca) %>%
  mutate(countyfp = str_sub(CountyCode, 3, 5),
         statefp = str_sub(CountyCode, 1, 2)) %>%
  left_join(counties.regions.1[,c(1,5,10,9)], by = c("countyfp", "statefp")) %>%
  rename(ps.Dem_Desc = 4,
         ps.countyfp = 8,
         ps.statefp = 9,
         ps.edr = 10,
         ps.pr = 11) 

kable(head(nsc.enrollment.location.join))
PersonID InstitutionName CountyCode ps.Dem_Desc grad.edr grad.ruca grad.pr ps.countyfp ps.statefp ps.edr ps.pr
10025234 Hamline University 27123 Urban/town/rural mix EDR 10 - Southeast Entirely urban Southern 123 27 EDR 11 - 7 County Twin Cities Seven County Mpls-St Paul
3090532 Crown College 27019 Entirely urban EDR 10 - Southeast Entirely urban Southern 019 27 EDR 11 - 7 County Twin Cities Seven County Mpls-St Paul
5943119 University of Minnesota-Twin Cities 27053 Entirely urban EDR 10 - Southeast Urban/town/rural mix Southern 053 27 EDR 11 - 7 County Twin Cities Seven County Mpls-St Paul
7693352 Azusa Pacific University 06037 Urban/town/rural mix EDR 10 - Southeast Town/rural mix Southern 037 06 NA NA
289664 Northwestern College 19167 Town/rural mix EDR 9 - South Central Urban/town/rural mix Southern 167 19 NA NA
9116706 Southwest Minnesota State University 27083 Town/rural mix EDR 10 - Southeast Urban/town/rural mix Southern 083 27 EDR 8 - Southwest Southwest
kable(names(nsc.enrollment.location.join))
x
PersonID
InstitutionName
CountyCode
ps.Dem_Desc
grad.edr
grad.ruca
grad.pr
ps.countyfp
ps.statefp
ps.edr
ps.pr


This joined dataset gives us 694,684 rows and 11 columns. The columns beginning with “ps” are the ruca category and regions of the post-secondary institution attended. The columns beginning with “grad” are the ruca category and regions of the high school from which the PersonID graduated.

One thing that’s important is to realize that joining the nsc.enrollment dataset with the Southern graduates dataset means we will have some NAs. Some of the students listed in the nsc.enrollment dataset aren’t in the Southern graduates dataset since they may have graduated before 2008 or didn’t actually meet the criteria in the Southern graduates dataset. Therefore, as we move forward, I will need to make sure to remove those NAs.

From here we can start creating new columns beginning with the RUCA category. to create this column we will gather each PersonID to examine whether or they attended a post-secondary institution in the same RUCA category, or if they attended multiple post-secondary institutions with one institution in the same category and another not the same.


nsc.enrollment.same.ruca <- nsc.enrollment.location.join %>% 
  mutate(ps.in.same.ruca = ifelse(ps.Dem_Desc == grad.ruca, "In same RUCA", "Outside RUCA")) %>%
  select(PersonID, ps.Dem_Desc, grad.ruca, ps.in.same.ruca) %>%
  select(PersonID, ps.in.same.ruca) %>%
  distinct(PersonID, ps.in.same.ruca) %>%
  mutate(code = ifelse(ps.in.same.ruca == "In same RUCA", 1, 2)) %>%
  group_by(PersonID) %>%
  summarize(code = sum(code)) %>%
  ungroup() %>%
  mutate(ps.in.same.ruca = ifelse(code == 1, "In same RUCA",
                                  ifelse(code == 2, "Outside RUCA", "Inside and outside same RUCA"))) %>%
  select(PersonID, ps.in.same.ruca)

kable(head(nsc.enrollment.same.ruca))
PersonID ps.in.same.ruca
280 In same RUCA
301 Outside RUCA
412 Outside RUCA
513 Outside RUCA
687 Outside RUCA
710 In same RUCA
kable(names(nsc.enrollment.same.ruca))
x
PersonID
ps.in.same.ruca


So there are fewer observations in this dataset than previous subsets. Why? This is due to dropping a PersonID that wasn’t in the Southern graduate dataset. I did not do that for the previous subsets. However, those previous subsets will be filtered down once we join it with the master dataset.

This dataset provides the PersonID and whether they attended a post secondary institution in a location with the same, outside, or both outside and inside (if attended multiple post-secondary institutions) RUCA categories of their high school from which they graduated. There are 90,596 rows and 2 columns.

Up next we will create a dataset indicating whether a PersonID that graduated from a Southern MN high school attended a post-secondary institution in their high school’s EDR.


nsc.enrollment.same.edr <- nsc.enrollment.location.join %>%
  mutate(ps.in.same.edr = ifelse(ps.edr == grad.edr, "In same EDR", "Outside EDR"),
         ps.in.same.edr = ifelse(is.na(ps.in.same.edr), "Outside EDR", ps.in.same.edr)) %>%
  select(PersonID, ps.edr, grad.edr, ps.in.same.edr) %>%
  select(PersonID, ps.in.same.edr) %>%
  distinct(PersonID, ps.in.same.edr) %>%
  mutate(code = ifelse(ps.in.same.edr == "In same EDR", 1, 2)) %>%
  group_by(PersonID) %>%
  summarize(code = sum(code)) %>%
  ungroup() %>%
  mutate(ps.in.same.edr = ifelse(code == 1, "In same EDR",
                                  ifelse(code == 2, "Outside EDR", "Inside and outside same EDR"))) %>%
  select(PersonID, ps.in.same.edr)

kable(head(nsc.enrollment.same.edr))
PersonID ps.in.same.edr
280 Outside EDR
301 Outside EDR
412 Outside EDR
513 In same EDR
687 Outside EDR
710 Outside EDR
kable(names(nsc.enrollment.same.edr))
x
PersonID
ps.in.same.edr


As expected there are 90,596 rows and 2 columns.

Next we will determine which Southern MN graduates attended a post-secondary institution in the same project region (Southern project region).


nsc.enrollment.same.pr <- nsc.enrollment.location.join %>%
  mutate(ps.in.same.pr = ifelse(ps.pr == grad.pr, "In same PR", "Outside PR"),
         ps.in.same.pr = ifelse(is.na(ps.in.same.pr), "Outside PR", ps.in.same.pr)) %>%
  select(PersonID, ps.pr, grad.pr, ps.in.same.pr) %>%
  select(PersonID, ps.in.same.pr) %>%
  distinct(PersonID, ps.in.same.pr) %>%
  mutate(code = ifelse(ps.in.same.pr == "In same PR", 1, 2)) %>%
  group_by(PersonID) %>%
  summarize(code = sum(code)) %>%
  ungroup() %>%
  mutate(ps.in.same.pr = ifelse(code == 1, "In same PR",
                                  ifelse(code == 2, "Outside PR", "Inside and outside same PR"))) %>%
  select(PersonID, ps.in.same.pr)

kable(head(nsc.enrollment.same.pr))
PersonID ps.in.same.pr
280 Outside PR
301 Outside PR
412 In same PR
513 In same PR
687 Outside PR
710 Outside PR
kable(names(nsc.enrollment.same.pr))
x
PersonID
ps.in.same.pr


As expected, there are 90,596 rows and 2 columns. This dataset provides whether the post-secondary institution(s) attended were in the same planning region as the high school from which they graduates, outside of the planning region, or both (attended multiple institutions).

Next we want to see how many of the students leave the state to attend post-secondary education.


nsc.enrollment.in.MN <- nsc.enrollment.location.join %>%
  mutate(ps.in.MN = ifelse(ps.statefp == "27", "In MN", "Outside MN")) %>%
  select(PersonID, ps.in.MN) %>%
  distinct(PersonID, ps.in.MN) %>%
  mutate(code = ifelse(ps.in.MN == "In MN", 1, 2)) %>%
  group_by(PersonID) %>%
  summarize(code = sum(code)) %>%
  ungroup() %>%
  mutate(ps.in.MN = ifelse(code == 1, "In MN",
                                  ifelse(code == 2, "Outside MN", "Inside and outside MN"))) %>%
  select(PersonID, ps.in.MN)

kable(head(nsc.enrollment.in.MN))
PersonID ps.in.MN
280 In MN
301 Outside MN
412 In MN
513 In MN
687 Outside MN
710 Outside MN
kable(names(nsc.enrollment.in.MN))
x
PersonID
ps.in.MN


As expected we have 90,596 rows and 2 columns. This dataset provides each distinct PersonID with whether they attended post secondary institutions inside MN, outside MN, or both.

Okay, now it’s time to join all of these with the master dataset. I will also create a new column that confirms whether they attended a college immediately (within the first year) after graduating high school.


master <- read_csv("Data/SLEDS/Masters/Master-8.csv") %>%
  mutate(PersonID = as.numeric(PersonID)) 

master.9 <- master %>%
  left_join(nsc.enrollment.confirm, by = "PersonID") %>%
  mutate(attended.ps = ifelse(is.na(attended.ps), "No", attended.ps)) %>%
  left_join(nsc.enrollment.first.attend, by = "PersonID") %>%
  mutate(attended.ps.first.year = as.integer(str_sub(first.attend.ps, 1, 4)),
         attended.ps.years.hsgrad = attended.ps.first.year - grad.year,
         attended.ps.within.first.year.hsgrad = ifelse(attended.ps.years.hsgrad < 2, "Yes", "No")) %>%
  left_join(nsc.enrollment.n.institutions, by = "PersonID") %>%
  left_join(nsc.enrollment.first.institution, by = "PersonID") %>%
  left_join(nsc.enrollment.institution.types, by = "PersonID") %>%
  left_join(nsc.enrollment.same.ruca, by = "PersonID") %>%
  left_join(nsc.enrollment.same.edr, by = "PersonID") %>%
  left_join(nsc.enrollment.same.pr, by = "PersonID") %>%
  left_join(nsc.enrollment.in.MN, by = "PersonID") %>%
  mutate(n.institutions = ifelse(is.na(n.institutions), 0, n.institutions),
         attended.ps.years.hsgrad = ifelse(attended.ps.years.hsgrad == -14, -1, attended.ps.years.hsgrad),
         Dem_Desc = fct_relevel(Dem_Desc, "Town/rural mix", "Urban/town/rural mix", "Entirely urban"))

nscenrollment.master <- read_csv("Data/SLEDS/Masters/Master-1.csv") %>%
  select(PersonID, grad.year) %>%
  left_join(nsc.enrollment.confirm, by = "PersonID") %>%
  mutate(attended.ps = ifelse(is.na(attended.ps), "No", attended.ps)) %>%
  left_join(nsc.enrollment.first.attend, by = "PersonID") %>%
  mutate(attended.ps.first.year = as.integer(str_sub(first.attend.ps, 1, 4)),
         attended.ps.years.hsgrad = attended.ps.first.year - grad.year,
         attended.ps.within.first.year.hsgrad = ifelse(attended.ps.years.hsgrad < 2, "Yes", "No")) %>%
  left_join(nsc.enrollment.n.institutions, by = "PersonID") %>%
  left_join(nsc.enrollment.first.institution, by = "PersonID") %>%
  left_join(nsc.enrollment.institution.types, by = "PersonID") %>%
  left_join(nsc.enrollment.same.ruca, by = "PersonID") %>%
  left_join(nsc.enrollment.same.edr, by = "PersonID") %>%
  left_join(nsc.enrollment.same.pr, by = "PersonID") %>%
  left_join(nsc.enrollment.in.MN, by = "PersonID") %>%
  mutate(n.institutions = ifelse(is.na(n.institutions), 0, n.institutions)) %>%
  select(-grad.year) 

write_csv(nscenrollment.master, "Data/SLEDS/Masters/nscenrollment_master.csv")

kable(head(master.9))
PersonID K12OrganizationID OrganizationName county.name countyfp Dem_Desc edr grad.year grad.year.covid Gender LimitedEnglishProficiencyIndicator HomelessIndicator economic.status pseo.participant SpecialEdStatus non.english.home RaceEthnicity n.years.attended ACTCompositeScore ap.exam total.cte.courses.taken ACTE-SPED Agriculture, Food, and Natural Resources Architecture & Construction Arts, A/V Technology & Communication Business, Management & Administrative Diversified Education & Training Finance Government & Public Administration Health Science Hospitality & Tourism Human Services Information Technology Law, Public Safety & Security Manufacturing Marketing Program Area (Diversified) Science, Technology, Engineering, & Mathematics STEM Transportation, Distribution & Logistics Work Experience-Handicapped Work Experience-Handicapped (16-20+ on IEP) Work Experience/Career Exploration (age 14-15 on I Work-Experience-Disadvantaged Youth Apprenticeship cte.achievement english.learner MCA.M MCA.R MCA.S sat.taken attended.ps first.attend.ps attended.ps.first.year attended.ps.years.hsgrad attended.ps.within.first.year.hsgrad n.institutions first.InstitutionSector ps.sector.1 ps.sector.2 ps.sector.3 ps.sector.4 ps.sector.5 ps.sector.6 ps.sector.7 ps.sector.8 ps.sector.9 ps.in.same.ruca ps.in.same.edr ps.in.same.pr ps.in.MN
7699638 29705 FILLMORE CENTRAL SENIOR HIGH Fillmore 045 Urban/town/rural mix EDR 10 - Southeast 2015 Pre-covid grad M N N 0 1 0 0 White 3 19 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 CTE Participant 0 3 4 3 0 Yes 20150824 2015 0 Yes 3 1 1 0 0 1 0 0 0 0 0 Inside and outside same RUCA Inside and outside same EDR Inside and outside same PR Inside and outside MN
2225025 132298 MAYO SENIOR HIGH Olmsted 109 Entirely urban EDR 10 - Southeast 2017 Pre-covid grad M N N 0 0 0 0 White 3 29 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 No CTE 0 4 4 4 1 Yes 20170905 2017 0 Yes 1 1 1 0 0 0 0 0 0 0 0 In same RUCA Outside EDR Outside PR In MN
10400093 62 Cannon Falls Secondary Goodhue 049 Urban/town/rural mix EDR 10 - Southeast 2014 Pre-covid grad F N N 0 1 0 0 White 3 20 1 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 CTE Participant 0 4 4 3 0 Yes 20140825 2014 0 Yes 2 4 0 1 0 1 0 0 0 0 0 In same RUCA Outside EDR Outside PR In MN
8786746 128661 LINCOLN SECONDARY Wabasha 157 Urban/town/rural mix EDR 10 - Southeast 2014 Pre-covid grad F N N 0 1 0 0 White 3 24 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 CTE Participant 0 4 4 3 0 Yes 20140902 2014 0 Yes 1 2 0 1 0 0 0 0 0 0 0 In same RUCA Outside EDR In same PR In MN
11122574 109999 AUSTIN SENIOR HIGH Mower 099 Urban/town/rural mix EDR 10 - Southeast 2011 Pre-covid grad F N N 0 1 0 0 White 3 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 No CTE 0 3 4 3 0 Yes 20110822 2011 0 Yes 2 4 1 0 0 1 0 0 0 0 0 In same RUCA Inside and outside same EDR Inside and outside same PR In MN
10181573 155484 CHATFIELD SECONDARY Olmsted 109 Entirely urban EDR 10 - Southeast 2010 Pre-covid grad M N N 0 NA 0 0 White 3 33 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CTE Participant 0 4 4 4 0 Yes 20120827 2012 2 No 2 4 1 0 0 1 0 0 0 0 0 In same RUCA Outside EDR Outside PR In MN
kable(names(master.9))
x
PersonID
K12OrganizationID
OrganizationName
county.name
countyfp
Dem_Desc
edr
grad.year
grad.year.covid
Gender
LimitedEnglishProficiencyIndicator
HomelessIndicator
economic.status
pseo.participant
SpecialEdStatus
non.english.home
RaceEthnicity
n.years.attended
ACTCompositeScore
ap.exam
total.cte.courses.taken
ACTE-SPED
Agriculture, Food, and Natural Resources
Architecture & Construction
Arts, A/V Technology & Communication
Business, Management & Administrative
Diversified
Education & Training
Finance
Government & Public Administration
Health Science
Hospitality & Tourism
Human Services
Information Technology
Law, Public Safety & Security
Manufacturing
Marketing
Program Area (Diversified)
Science, Technology, Engineering, & Mathematics
STEM
Transportation, Distribution & Logistics
Work Experience-Handicapped
Work Experience-Handicapped (16-20+ on IEP)
Work Experience/Career Exploration (age 14-15 on I
Work-Experience-Disadvantaged
Youth Apprenticeship
cte.achievement
english.learner
MCA.M
MCA.R
MCA.S
sat.taken
attended.ps
first.attend.ps
attended.ps.first.year
attended.ps.years.hsgrad
attended.ps.within.first.year.hsgrad
n.institutions
first.InstitutionSector
ps.sector.1
ps.sector.2
ps.sector.3
ps.sector.4
ps.sector.5
ps.sector.6
ps.sector.7
ps.sector.8
ps.sector.9
ps.in.same.ruca
ps.in.same.edr
ps.in.same.pr
ps.in.MN


As expected, we have the same number of rows as the original master dataset - 122,938 rows. Here are the explanations of all the new columns added to the master dataset.

  • attended.ps - this is a confirmation that the PersonID attended a post-secondary institution at some point after graduating high school.
  • attended.ps.years.hsgrad - the number of years after graduation that a PersonID attended a post-secondary education. In some cases, the value is negative indicating that the PersonID attended a post-secondary institution before graduating high school.
  • attended.ps.within.first.year.hsgrad - did the PersonID attend a post-secondary education institution within 1 year or less from graduating high school.
  • n.institutions - how many post-secondary institutions the PersonID attended
  • first.InstitutionSector - the sector of the first post-secondary institution attended.
  • InstitutionSector - The sector of the post-secondary institution attended (could be multiple)
  • ps.in.same.ruca - did the PersonID attend a post-secondary institution in the same RUCA category as their high school
  • ps.in.same.edr - did the PersonID attend a post-secondary institution in the same economic development region as high school
  • ps.in.same.pr - did the PersonID attend a post-secondary institution in the same planning region as high school
  • ps.in.MN = did the PersonID attend a post-secondary institution in Minnesota.


Summary of students attending post-secondary

Lets summarize the percentage of students that attended a post-secondary institution and compare across RUCA categories.

Below is the percentage of students that attended a post-secondary institution from the entire dataset. 72% of the PersonIDs in the dataset attended a post-secondary institution at some point between 2007 and 2024.


attending.ps.pct.total <- master.9 %>%
  group_by(attended.ps) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n)) 

datatable(attending.ps.pct.total, class = "cell-border stripe", filter = "top", rownames = FALSE,
          options = list(columnDefs = list(list(className = "dt-center", targets = 1:2)))) %>%
  formatCurrency(2, "", digits = 0) %>%
  formatPercentage(3)


Now lets check to see if this percentage is statistically significantly different by RUCA group.

The crosstabs do indicate that there is a statistically significant difference in the percentage of graduates that attend a post-secondary institution by high school RUCA category. Entirely urban districts have the highest percentage of graduates attend a post-secondary institution - 75% compared to 70% for the other RUCA categories.


attending.ps.pct.ruca <- master.9 %>%
  group_by(Dem_Desc, attended.ps) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  group_by(Dem_Desc) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  mutate(data_id = as.character(seq(n())))

attending.ps.pct.ruca.plot <- ggplot(attending.ps.pct.ruca, aes(Dem_Desc, pct, fill = attended.ps, group = attended.ps)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste(Dem_Desc, "\nAttended a post-secondary intitution: ", attended.ps, "\nNumber: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1),), position = position_dodge(width = .9), show.legend = FALSE, color = "white", size = 5) +
    labs(x="", y = "", color="", title = "Attended a post-secondary institution")+
    scale_y_continuous(labels=scales::percent)+
    scale_x_discrete(guide = guide_axis(n.dodge = 1)) +
    theme_bar+
    scale_fill_manual(values = brewer.pal(n = 5, "RdYlBu"),
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16))


girafe(ggobj = attending.ps.pct.ruca.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
CrossTable(master.9$Dem_Desc, master.9$attended.ps, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  122938 

 
                     | master.9$attended.ps 
   master.9$Dem_Desc |        No |       Yes | Row Total | 
---------------------|-----------|-----------|-----------|
      Town/rural mix |      6322 |     15801 |     22123 | 
                     |  6129.722 | 15993.278 |           | 
                     |     0.286 |     0.714 |     0.180 | 
---------------------|-----------|-----------|-----------|
Urban/town/rural mix |     17175 |     41824 |     58999 | 
                     | 16347.126 | 42651.874 |           | 
                     |     0.291 |     0.709 |     0.480 | 
---------------------|-----------|-----------|-----------|
      Entirely urban |     10566 |     31250 |     41816 | 
                     | 11586.152 | 30229.848 |           | 
                     |     0.253 |     0.747 |     0.340 | 
---------------------|-----------|-----------|-----------|
        Column Total |     34063 |     88875 |    122938 | 
---------------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  190.5888     d.f. =  2     p =  4.113063e-42 


 


Summary of years between graduation and attending post-secondary

Now lets take a look at the number of years between high school graduation and attending post-secondary. We will begin with the total number of students before diving into differences across RUCA categories and regions.

A huge majority of students that attended a post secondary institution waited less than 1 year after graduating high school (87%)


years.between.grad.ps.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  select(PersonID, first.attend.ps, attended.ps.first.year, attended.ps.within.first.year.hsgrad, attended.ps.years.hsgrad, grad.year) %>%
  summarize(`Number attended post-secondary` = comma(n()),
            `Mean years` = mean(attended.ps.years.hsgrad),
            `Median years` = median(attended.ps.years.hsgrad),
            `Max years` = max(attended.ps.years.hsgrad),
            `Min years` = min(attended.ps.years.hsgrad),
            `Standard deviation` = sd(attended.ps.years.hsgrad)) 

datatable(years.between.grad.ps.total, class = "cell-border stripe", filter = "top", rownames = FALSE,
          options = list(columnDefs = list(list(className = "dt-center", targets = 0:5)))) %>%
  formatCurrency(2, "", digits = 3) %>%
  formatCurrency(6, "", digits = 4)
years.between.grad.ps.dist.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  mutate(attended.ps.years.hsgrad.bins = cut(attended.ps.years.hsgrad,
                                             breaks = c(-1.1, .1, 1, 2, 3, 100),
                                             labels = c("Less than one year", "1 year", "2 years", "3 years", "4 years or more")),
         data_id = as.character(seq(n()))) %>%
  group_by(attended.ps.years.hsgrad.bins) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         data_id = as.character(seq(n())))
  

years.between.grad.ps.dist.total.plot <- ggplot(years.between.grad.ps.dist.total, aes(attended.ps.years.hsgrad.bins, pct)) +
    geom_col_interactive(aes(data_id = data_id, tooltip = paste("\nNumber of years between high school graduation and attending post secondary: ", attended.ps.years.hsgrad.bins, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color =  "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students and the number of years between high school and attending post secondary")+
    scale_y_continuous(labels=scales::percent)+
    scale_x_discrete(guide = guide_axis(n.dodge = 1)) +
    theme_bar+
  theme(legend.position = "none",
          text = element_text(size = 16),
        axis.text.x = element_text(angle = 25, hjust = .9))

girafe(ggobj = years.between.grad.ps.dist.total.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


Summary of number of colleges attended

Next we will summarize the number of colleges attended by producing the summary statistics and distribution.

The table and distribution chart below show that a large majority (60%) of individuals attended only one post secondary institutions followed by 27.4% attending two.


n.colleges.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  summarize(`Number` = n(),
            `Mean` = mean(n.institutions),
            `Median` = median(n.institutions),
            `Max` = max(n.institutions),
            `Min` = min(n.institutions),
            `Standard deviation` = sd(n.institutions))

datatable(n.colleges.total, class = "cell-border stripe", filter = "top", rownames = FALSE,
          options = list(columnDefs = list(list(className = "dt-center", targets = 1:5)))) %>%
  formatCurrency(2, "", digits = 2) %>%
  formatCurrency(6, "", digits = 2)
n.colleges.dist.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  mutate(n.institutions.bins = cut(n.institutions,
                               breaks = c(0, 1, 2, 3, 4, 20),
                               labels = c("One", "Two", "Three", "Four", "More than 4"))) %>%
  group_by(n.institutions) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         data_id = as.character(seq(n())))


n.colleges.dist.total.plot <- ggplot(n.colleges.dist.total, aes(n.institutions, pct)) +
  geom_col_interactive(aes(data_id = data_id, tooltip = paste("Number of institutions attended: ", comma(n.institutions, accuracy = 1), "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
  geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
  labs(x="", y = "", color="", title = "Percent of students by number of post secondary institutions attended")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    theme(legend.position = "none",
          text = element_text(size = 16))


girafe(ggobj = n.colleges.dist.total.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


I’m not sure it’s all that important knowing whether there are differences in the percentage of students and the number of institutions they’ve attended across RUCA categories or regions. So we will skip that analysis for now.


Summary of types of first college attended

I think this will be very interesting - we want to see what they breakdown is of students attending different types of colleges. We will start by summarizing the total dataset before looking at differences across RUCA categories and regions.

The chart below provides the percentage of students in the entire dataset that attended each institution sector. By far, a huge majority attend either a public 4-year (44%) or public 2-year (38%).


sectors <- read_xlsx('/Users/kellyasche/Library/CloudStorage/GoogleDrive-kasche@ruralmn.org/My Drive/Research/FY25-27/FY26/OHE Journey to meaningful employment/Data/SLEDS/NSC/IPEDS/institution-sectors.xlsx')

college.type.summary.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(first.InstitutionSector) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n)) %>%
  left_join(sectors, by = c("first.InstitutionSector" = "Code")) %>%
  mutate(data_id = as.character(seq(n())))

college.type.summary.total.plot <- ggplot(college.type.summary.total, aes(x = fct_rev(fct_reorder(Sector, pct)), pct)) +
    geom_col_interactive(aes(data_id = data_id, tooltip = paste("College sector: ", Sector, "\nNumber of students in college sector: ", comma(n, accuracy = 1), "\nPercent of students in college sector: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students first college sector")+
    scale_y_continuous(labels=scales::percent)+
    scale_x_discrete(guide = guide_axis(n.dodge = 1)) +
    theme_bar+
    theme(legend.position = "none",
          text = element_text(size = 18),
          axis.text.x = element_text(angle = 45, hjust = 1))


girafe(ggobj = college.type.summary.total.plot, width_svg = 10, height_svg = 6) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


Next we will check to see if those percentages are significantly different by RUCA category of the graduates high school.

The crosstabs indicate that there is a significant difference in the percentage of students attending different institution sectors depending on the RUCA category of their high school. The p-value was near zero.

The primary difference is the graduates from town/rural mix districts attend 2-year colleges at a higher percentage and attend public, 4-year colleges at a lower level.


first.college.sector.ruca <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(Dem_Desc, first.InstitutionSector) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  group_by(Dem_Desc) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  left_join(sectors, by = c("first.InstitutionSector" = "Code")) %>%
  mutate(data_id = as.character(seq(n()))) %>%
  mutate(total.pct = n / sum(n))

first.college.sector.ruca.plot <- ggplot(first.college.sector.ruca, aes(fct_rev(fct_reorder(Sector, pct)), pct, fill = Dem_Desc, group = Dem_Desc)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste(Dem_Desc, "\nInstitution sector: ", Sector, "\nNumber of graduates attending: ", comma(n, accuracy = 1), "\nPercent of students attending: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students and first college attended\nby RUCA category")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.ruca,
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 45, hjust = 1))


girafe(ggobj = first.college.sector.ruca.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
first.college.sector.ct.1.ruca <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  select(Dem_Desc, first.InstitutionSector) %>%
  left_join(sectors, by = c("first.InstitutionSector" = "Code")) %>%
  filter(Sector %in% c("Public, 4-year or above", "Public, 2-year", "Private not-for-profit, 4-year or above", "Private for-profit, 4-year or above")) %>%
  mutate(Sector = as_factor(Sector))

CrossTable(first.college.sector.ct.1.ruca$Dem_Desc, first.college.sector.ct.1.ruca$first.InstitutionSector, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88752 

 
                                        | first.college.sector.ct.1.ruca$first.InstitutionSector 
first.college.sector.ct.1.ruca$Dem_Desc |         1 |         2 |         3 |         4 | Row Total | 
----------------------------------------|-----------|-----------|-----------|-----------|-----------|
                         Town/rural mix |      6899 |      2456 |       262 |      6162 |     15779 | 
                                        |  6880.022 |  2632.678 |   195.211 |  6071.089 |           | 
                                        |     0.437 |     0.156 |     0.017 |     0.391 |     0.178 | 
----------------------------------------|-----------|-----------|-----------|-----------|-----------|
                   Urban/town/rural mix |     17411 |      7320 |       503 |     16514 |     41748 | 
                                        | 18203.129 |  6965.526 |   516.488 | 16062.857 |           | 
                                        |     0.417 |     0.175 |     0.012 |     0.396 |     0.470 | 
----------------------------------------|-----------|-----------|-----------|-----------|-----------|
                         Entirely urban |     14388 |      5032 |       333 |     11472 |     31225 | 
                                        | 13614.849 |  5209.796 |   386.302 | 12014.054 |           | 
                                        |     0.461 |     0.161 |     0.011 |     0.367 |     0.352 | 
----------------------------------------|-----------|-----------|-----------|-----------|-----------|
                           Column Total |     38698 |     14808 |      1098 |     34148 |     88752 | 
----------------------------------------|-----------|-----------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  183.4382     d.f. =  6     p =  6.313396e-37 


 


Next we will check to see if those percentages are significantly different by EDR of the graduates high school.

The crosstabs indicate that there is a significant difference in the percentage of students attending different institution sectors depending on the EDR of their high school. The p-value was near zero.

The biggest difference is that EDR 9 attends 4-year public institution at a much higher rate while attending a 2-year public at a much lower rate. EDR 10 was the opposite.


first.college.sector.edr <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(edr, first.InstitutionSector) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  group_by(edr) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  left_join(sectors, by = c("first.InstitutionSector" = "Code")) %>%
  mutate(data_id = as.character(seq(n()))) %>%
  mutate(total.pct = n / sum(n))

first.college.sector.edr.plot <- ggplot(first.college.sector.edr, aes(fct_rev(fct_reorder(Sector, pct)), pct, fill = edr, group = edr)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste(edr, "\nInstitution sector: ", Sector, "\nNumber of graduates attending: ", comma(n, accuracy = 1), "\nPercent of students attending: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students and first college attended\nby edr category")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.edr,
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 45, hjust = 1))


girafe(ggobj = first.college.sector.edr.plot, height_svg = 10) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
first.college.sector.ct.1.edr <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  select(edr, first.InstitutionSector) %>%
  left_join(sectors, by = c("first.InstitutionSector" = "Code")) %>%
  filter(Sector %in% c("Public, 4-year or above", "Public, 2-year", "Private not-for-profit, 4-year or above", "Private for-profit, 4-year or above")) %>%
  mutate(Sector = as_factor(Sector))

CrossTable(first.college.sector.ct.1.edr$edr, first.college.sector.ct.1.edr$first.InstitutionSector, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88752 

 
                                  | first.college.sector.ct.1.edr$first.InstitutionSector 
first.college.sector.ct.1.edr$edr |         1 |         2 |         3 |         4 | Row Total | 
----------------------------------|-----------|-----------|-----------|-----------|-----------|
               EDR 10 - Southeast |     26359 |     10619 |       686 |     25620 |     63284 | 
                                  | 27593.341 | 10558.742 |   782.921 | 24348.995 |           | 
                                  |     0.417 |     0.168 |     0.011 |     0.405 |     0.713 | 
----------------------------------|-----------|-----------|-----------|-----------|-----------|
            EDR 9 - South Central |     12339 |      4189 |       412 |      8528 |     25468 | 
                                  | 11104.659 |  4249.258 |   315.079 |  9799.005 |           | 
                                  |     0.484 |     0.164 |     0.016 |     0.335 |     0.287 | 
----------------------------------|-----------|-----------|-----------|-----------|-----------|
                     Column Total |     38698 |     14808 |      1098 |     34148 |     88752 | 
----------------------------------|-----------|-----------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  466.635     d.f. =  3     p =  8.106745e-101 


 


Summary of attending college with same RUCA category

Up next is determining how many students attended a post-secondary institution located in a county with the same RUCA category as their high school. First we will look at the percentages of the total dataset and then we will break it up.

The chart below shows that 37% of students graduating from a Southern MN high school attended a post secondary institution that was located in a county with the same RUCA category as their high school.


ps.same.ruca.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(ps.in.same.ruca) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         data_id = as.character(seq(n())))

ps.same.ruca.total.plot <- ggplot(ps.same.ruca.total, aes(fct_rev(fct_reorder(ps.in.same.ruca, pct)), pct)) +
    geom_col_interactive(aes(data_id = data_id, tooltip = paste("Location of post-secondary institution: ", ps.in.same.ruca, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending post secondary in RUCA category")+
    scale_y_continuous(labels=scales::percent)+
    scale_x_discrete(guide = guide_axis(n.dodge = 1)) +
    theme_bar+
    theme(legend.position = "none",
          text = element_text(size = 18))


girafe(ggobj = ps.same.ruca.total.plot, width_svg = 10, height_svg = 6) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


Next, lets check to see if the type of RUCA category they graduated from is related to whether the college they attend is in the same or different RUCA category.

The crosstabs indicate that there is a relationship in the location of the individuals high school graduation and the RUCA category of their post-secondary institution(s). The p-value was 0.

A significantly higher percentage of town/rural mix graduates attended a post-secondary institution outside their RUCA category.


ps.same.ruca.ruca <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(Dem_Desc, ps.in.same.ruca) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  complete(ps.in.same.ruca, Dem_Desc) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  group_by(Dem_Desc) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  mutate(data_id = as.character(seq(n())))

ps.same.ruca.ruca.plot <- ggplot(ps.same.ruca.ruca, aes(ps.in.same.ruca, pct, group = Dem_Desc, fill = Dem_Desc)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste("RUCA category comparison between high school and college: ", "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending post-secondary of same RUCA\ncategory by RUCA category of high school")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.ruca,
                       guide = guide_legend(ncol = 3),
                      breaks = c("Entirely urban", "Town/rural mix", "Urban/town/rural mix")) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.same.ruca.ruca.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
ps.same.ruca.ct.ruca <- master.9 %>%
  filter(attended.ps == "Yes")

CrossTable(ps.same.ruca.ct.ruca$Dem_Desc, ps.same.ruca.ct.ruca$ps.in.same.ruca, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88868 

 
                              | ps.same.ruca.ct.ruca$ps.in.same.ruca 
ps.same.ruca.ct.ruca$Dem_Desc |                 In same RUCA | Inside and outside same RUCA |                 Outside RUCA |                    Row Total | 
------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
               Town/rural mix |                         2338 |                         1986 |                        11477 |                        15801 | 
                              |                     5793.712 |                     3565.840 |                     6441.448 |                              | 
                              |                        0.148 |                        0.126 |                        0.726 |                        0.178 | 
------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
         Urban/town/rural mix |                        18267 |                         9940 |                        13614 |                        41821 | 
                              |                    15334.398 |                     9437.820 |                    17048.782 |                              | 
                              |                        0.437 |                        0.238 |                        0.326 |                        0.471 | 
------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
               Entirely urban |                        11980 |                         8129 |                        11137 |                        31246 | 
                              |                    11456.890 |                     7051.341 |                    12737.769 |                              | 
                              |                        0.383 |                        0.260 |                        0.356 |                        0.352 | 
------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
                 Column Total |                        32585 |                        20055 |                        36228 |                        88868 | 
------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  8366.953     d.f. =  4     p =  0 


 


Next, lets check to see if the EDR they graduated from is related to whether the college they attend is in the same or different EDR.

The crosstabs indicate that there is a relationship in the location of the individuals high school graduation and the EDR their post-secondary institution(s). The p-value was 0.

The differences are minimal.


ps.same.edr.edr <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(edr, ps.in.same.edr) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  complete(ps.in.same.edr, edr) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  group_by(edr) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  mutate(data_id = as.character(seq(n())))

ps.same.edr.edr.plot <- ggplot(ps.same.edr.edr, aes(ps.in.same.edr, pct, group = edr, fill = edr)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste("EDR comparison between high school and college: ", "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending post-secondary of same EDR\nby EDR of high school")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.edr,
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.same.edr.edr.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
ps.same.ruca.ct.edr <- master.9 %>%
  filter(attended.ps == "Yes")

CrossTable(ps.same.ruca.ct.edr$edr, ps.same.ruca.ct.edr$ps.in.same.edr, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88875 

 
                        | ps.same.ruca.ct.edr$ps.in.same.edr 
ps.same.ruca.ct.edr$edr |                 In same EDR | Inside and outside same EDR |                 Outside EDR |                   Row Total | 
------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|
     EDR 10 - Southeast |                       17127 |                       12207 |                       34036 |                       63370 | 
                        |                   17623.812 |                   12196.274 |                   33549.914 |                             | 
                        |                       0.270 |                       0.193 |                       0.537 |                       0.713 | 
------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|
  EDR 9 - South Central |                        7590 |                        4898 |                       13017 |                       25505 | 
                        |                    7093.188 |                    4908.726 |                   13503.086 |                             | 
                        |                       0.298 |                       0.192 |                       0.510 |                       0.287 | 
------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|
           Column Total |                       24717 |                       17105 |                       47053 |                       88875 | 
------------------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  73.37579     d.f. =  2     p =  1.165866e-16 


 


Summary of attending college in same planning region as high school

Now we want to see how many students attend a college that is located in the same planning region as their high school.

The chart below shows that nearly 45% of students graduating from Southern MN High schools leave the region to attend post secondary education.


ps.same.pr.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(ps.in.same.pr) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         data_id = as.character(seq(n())))

ps.same.pr.total.plot <- ggplot(ps.same.pr.total, aes(fct_rev(fct_reorder(ps.in.same.pr, pct)), pct)) +
    geom_col_interactive(aes(data_id = data_id, tooltip = paste("College planning region location: ", ps.in.same.pr, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending college within\nsame planning region")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    theme(legend.position = "none",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.same.pr.total.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


Summary of attending college in Minnesota

Now we want to see how many students stay or leave Minnesota to attend college.

The chart below shows that 62% of students graduating from Southern MN high schools attend a post secondary institution in Minnesota.


ps.in.MN.total <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(ps.in.MN) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         data_id = as.character(seq(n())))

ps.in.MN.total.plot <- ggplot(ps.in.MN.total, aes(fct_rev(fct_reorder(ps.in.MN, pct)), pct)) +
    geom_col_interactive(aes(data_id = data_id, tooltip = paste("College planning region location: ", ps.in.MN, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending college in\nMinnesota")+
    scale_y_continuous(labels=scales::percent)+
    scale_x_discrete(guide = guide_axis(n.dodge = 1)) +
    theme_bar+
    theme(legend.position = "none",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.in.MN.total.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))


Next, lets check to see if the RUCA category of their high school is related to whether they attend a college inside or outside Minnesota.

The crosstabs below indicate that there is a relationship between the RUCA category of a student’s high school and whether they attend a college inside or outside of Minnesota. The p-value was near zero.

The differences are minimal.


ps.in.MN.ruca <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(Dem_Desc, ps.in.MN) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  group_by(Dem_Desc) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  mutate(data_id = as.character(seq(n())))

ps.in.MN.ruca.plot <- ggplot(ps.in.MN.ruca, aes(ps.in.MN, pct, fill = Dem_Desc, group = Dem_Desc)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste(Dem_Desc, "\nCollege in Minnesota or not: ", ps.in.MN, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending college in\nMinnesota")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.ruca,
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.in.MN.ruca.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
ps.in.MN.ct.ruca <- master.9 %>%
  filter(attended.ps == "Yes")

CrossTable(ps.in.MN.ct.ruca$Dem_Desc, ps.in.MN.ct.ruca$ps.in.MN, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88875 

 
                          | ps.in.MN.ct.ruca$ps.in.MN 
ps.in.MN.ct.ruca$Dem_Desc |                 In MN | Inside and outside MN |            Outside MN |             Row Total | 
--------------------------|-----------------------|-----------------------|-----------------------|-----------------------|
           Town/rural mix |                  9693 |                  2430 |                  3678 |                 15801 | 
                          |              9815.377 |              2465.934 |              3519.689 |                       | 
                          |                 0.613 |                 0.154 |                 0.233 |                 0.178 | 
--------------------------|-----------------------|-----------------------|-----------------------|-----------------------|
     Urban/town/rural mix |                 26232 |                  6433 |                  9159 |                 41824 | 
                          |             25980.528 |              6527.132 |              9316.340 |                       | 
                          |                 0.627 |                 0.154 |                 0.219 |                 0.471 | 
--------------------------|-----------------------|-----------------------|-----------------------|-----------------------|
           Entirely urban |                 19283 |                  5007 |                  6960 |                 31250 | 
                          |             19412.096 |              4876.934 |              6960.970 |                       | 
                          |                 0.617 |                 0.160 |                 0.223 |                 0.352 | 
--------------------------|-----------------------|-----------------------|-----------------------|-----------------------|
             Column Total |                 55208 |                 13870 |                 19797 |                 88875 | 
--------------------------|-----------------------|-----------------------|-----------------------|-----------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  19.94633     d.f. =  4     p =  0.0005117295 


 


Next, lets check to see if the EDR of their high school is related to whether they attend a college inside or outside Minnesota.

The crosstabs below indicate that there is a relationship between the EDR category of a student’s high school and whether they attend a college inside or outside of Minnesota. The p-value was near zero.

EDR 9 has a signifcantly higher percentage of graduates attend college in Minnesota. EDR 9 had 67% of their graduates attend college in Minnesota while 60% of EDR 10 attended in Minnesota.


ps.in.MN.edr <- master.9 %>%
  filter(attended.ps == "Yes") %>%
  group_by(edr, ps.in.MN) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  group_by(edr) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup() %>%
  mutate(data_id = as.character(seq(n())))

ps.in.MN.edr.plot <- ggplot(ps.in.MN.edr, aes(ps.in.MN, pct, fill = edr, group = edr)) +
    geom_col_interactive(position = "dodge", aes(data_id = data_id, tooltip = paste(edr, "\nCollege in Minnesota or not: ", ps.in.MN, "\nNumber of students: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
    geom_label(position = position_dodge(width = .9), aes(label = percent(pct, accuracy = .1)), show.legend = FALSE, color = "black", size = 5) +
    labs(x="", y = "", color="", title = "Percent of students attending college in\nMinnesota")+
    scale_y_continuous(labels=scales::percent)+
    theme_bar+
    scale_fill_manual(values= color.edr,
                       guide = guide_legend(ncol = 3)) +
    theme(legend.position = "bottom",
          text = element_text(size = 16),
          axis.text.x = element_text(angle = 25, hjust = .9))


girafe(ggobj = ps.in.MN.edr.plot) %>%
  girafe_options(opts_selection(type = "none"),
                 opts_sizing(rescale = FALSE))
ps.in.MN.ct.edr <- master.9 %>%
  filter(attended.ps == "Yes")

CrossTable(ps.in.MN.ct.edr$edr, ps.in.MN.ct.edr$ps.in.MN, expected = TRUE, prop.t = FALSE, prop.c = FALSE, prop.chisq = FALSE)

 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  88875 

 
                      | ps.in.MN.ct.edr$ps.in.MN 
  ps.in.MN.ct.edr$edr |                 In MN | Inside and outside MN |            Outside MN |             Row Total | 
----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
   EDR 10 - Southeast |                 38120 |                 10198 |                 15052 |                 63370 | 
                      |             39364.624 |              9889.642 |             14115.734 |                       | 
                      |                 0.602 |                 0.161 |                 0.238 |                 0.713 | 
----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
EDR 9 - South Central |                 17088 |                  3672 |                  4745 |                 25505 | 
                      |             15843.376 |              3980.358 |              5681.266 |                       | 
                      |                 0.670 |                 0.144 |                 0.186 |                 0.287 | 
----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
         Column Total |                 55208 |                 13870 |                 19797 |                 88875 | 
----------------------|-----------------------|-----------------------|-----------------------|-----------------------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  387.0265     d.f. =  2     p =  9.083746e-85 


 


What states are students going to college

The last piece here is to provide a map of the United States to see where students are going for college.

Outside of Minnesota, Southern MN graduates attend colleges in Wisconsin at the highest rate - 10.5%. This is followed by Iowa at 5.5%.


us_boundary <- read_sf('/Users/kellyasche/Library/CloudStorage/GoogleDrive-kasche@ruralmn.org/My Drive/Data Prep/R Projects/Shapefiles/US shapefiles/cb_2018_us_state_500k.shp', quiet = TRUE) %>%
  tigris::shift_geometry()
  
ps.state <- nsc.enrollment.location.join %>%
  select(PersonID, ps.statefp) %>%
  distinct(PersonID, ps.statefp) %>%
  group_by(ps.statefp) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  mutate(pct = n / sum(n),
         pct.bins = cut(pct,
                        breaks = c(-1, 0, .01, .05, .10, 1),
                        labels = c("0", "0% - 1%", "1% - 5%", "5% - 10%", "More than 10%"))) %>% 
  right_join(us_boundary[,c(1,6,10)], by = c("ps.statefp" = "STATEFP")) %>%
  filter(!ps.statefp %in% c("60", "66", "69", "72", "78"))


ps.state.map <- ggplot(ps.state) +
  geom_sf_interactive(color = "grey85", aes(geometry = geometry, fill = pct.bins, data_id = NAME, tooltip = paste(NAME, "\nNumber of students enrolled in college from SW MN: ", comma(n, accuracy = 1), "\nPercent of students: ", percent(pct, accuracy = .1), sep = ""))) +
  coord_sf(crs = 'ESRI:102003') + 
  theme_sf+
  scale_fill_manual(values = brewer.pal(n = 4, "PuBu")) +
  labs(title = "Percent of students attending colleges in states") +
  theme(text = element_text(size = 18))

girafe(ggobj = ps.state.map, height_svg = 10, width_svg = 10) %>%
  girafe_options(opts_sizing(rescale = FALSE))


write_csv(master.9, "Data/SLEDS/Masters/Master-9.csv")